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PreSTA:  Preventative  Spatio-Temporal  Aggregation 


PROBLEM 

SOLUTION 


•  Traditional  punishment  mechanisms  (i.e.,  blacklists)  are  reactive 

•  PreSTA:  Detect  malicious  users  (i.e.,  spammers)  before  harm  is  done 


•  Malicious  users  are  spatially  clustered  (in  any  dimension) 
sh*  •  Malicious  users  are  likely  to  repeat  bad  behaviors  (temporal) 


A  historical  record  of  those  principals  known  to  be  bad,  and  the 
timestamp  of  this  observation  (feedback) 


An  extended  list  of  principals  who  are  thought  to  be  bad  now, 
based  on  their  past  history,  and  history  of  those  around  them 
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TALK  OUTLINE 


PreSTA  Running  Example:  Spam  Detection 


•  Spatio-temporal  properties  of  spam  mail 

•  Basis  for  spatial  groupings 

•  Calculating  and  combining  reputations 

•  Classifier  performance 


Generalizing  PreSTA:  Additional  Use-Cases  for  Model 


•  Malicious  editors  on  Wikipedia 

•  Applicability  to  the  QuanTM  model 

•  General  PreSTA  use-case  criteria 


Conclusions  &  References 
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SPAM:  TEMPORAL  PROPERTIES 


TEMPORAL:  Bad  Guys  Repeat  Bad  Behaviors 


•  Spammers  want  to  maximize 
utilization  of  available  IP 
addresses,  leading  to  re-use 

•  Bot-nets  will  compromise  a 
machine  until  patched 

•  Blacklist  entries  have 
predictable  duration  (~6  days), 
making  for  trivial  recycling 

•  Most  mail  servers  have  static  IP  addresses,  so  IP  acts  as  a  persistent 
identifier  -  though  we  later  discuss  DHCP  considerations 
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IP  DELEGATION  HIERARCHY 


(1)  Internet  Assigned  Numbers  Auth.: 
Controls  all  IP  delegation  (root  of  trust) 

(2)  Regional  Internet  Registries: 
Continent-level  equivalent  of  the  IANA 

(3)  Autonomous  Systems  (ISPs): 
Broadcast  the  IPs  they  control  via  the 
Border  Gateway  Protocol  (BGP) 

(4)  Local  routers  distribute  addresses 
from  some  pool  (i.e.,  a  /24).  Such 
subnet  boundaries  are  NOT  known 


(5)  Individual  IP:  Overtime  a  single  IP 
may  have  multiple  inhabitants  (due  to 
dynamic  nature  -  DHCP) 
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SPATIAL  GROUPINGS 


•  What  is  /24  (256  IP)  membership? 

•  Valuate  that  block  and  two  adjacent 

•  Estimation  of  subnet  membership 

•  Simplest  case.  Little  spatial  value. 

•  Due  to  DHCP,  may  have  multiple 
inhabitants  overtime,  though 


BLOCK 


AS(es) 

|1000's| 

IPs 

\ 

Subnet-level 
Block-Heuristic 
|768|  IPs 


IP-level 

I1IIP 


11/4/2009 


!  vj)  Penn 

Engineering 


ONR-MURI  Review 


SPAM:  SPATIAL  PROPERTIES 


SPATIAL:  Bad  Guys  Live  in  Close  Proximity  [3]  (IP) 


•  Some  ISPs/AS  willing  to 
trade  behavioral  leniency  for 
compensation:  McColo 
Corp.  and  3FN 

•  Some  geographical 
jurisdictions  are  more  lenient 
than  others  (and  this  maps 
into  IP  space) 

•  As  IPs  become  BL'ed,  operations  must  shift  to  'fresh'  addresses, 
likely  those  from  the  same  allocation  ( i.e subnets) 
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PreSTA:  SPAM  USAGE 
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PreSTA:  SPAM  USAGE 
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VALUATION  WORKFLOW 
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REPUTATION  ALGORITHM 


•  To  calculate  reputation  for  entity  a: 


Old 
Black 
List  1 


i  <=  |BL(a)| 


1 


SELECT  ROWS 
MAPPING  TO  a 
— >  BL(a) 


raw_rep(a)  = 


z 


i=1 


time_decay(BL(a)i) 

magnitude(a) 


REP(a)  =  1.0-  (raw_rep(a)  *  cp_1  ) 


time_decay(*):  Returns  on  [0,1],  higher  weight  to  more  recent  events 

magnitude(a):  Number  of  IPs  in  grouping  a 

(p:  Normalization  constant  putting  REP()  on  [0,1] 
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SVM  LEARNING 


•  Combination  strategies 

•  Support  Vector  Machine 

Supervised  learning 

Train  over  previous  email 
to  classify  current  emails 

•  Draws  surface  (threshold) 
best  separating  points 

Can  adjust  penalty  weight 
to  keep  false  positives  low 

Polynomial,  RBF  kernels 
improve  on  linear 
performance 


BHam  Mails  (10k) 
Spam  Mails  (10k) 
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SPAM:  TESTING  DATASETS 
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SPAM:  PERFORMANCE  (1) 
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Captures  up  to  50%  of  mail  not 
caught  by  traditional  blacklists 
with  the  same  low  false-positives 


We  capture  between  20- 
50%  of  spam  that  gets  past 
current  blacklists 

By  design  our  FP-rate 
is  equivalent  to  BLs: 
-0.4% 

Total  blockage  remains 
near  constant:  90% 

Blacklists  are  reactive, 
we  are  predictive.  We 
can  cover  its  slack 

Cat  and  mouse.  Graph 
should  roll  over  time 
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SPAM:  PERFORMANCE  (2) 


BL  SPAMS  (yl)  REP  (y2)  . him 


IP-204.xxx.9.154  History,  Sept.  9  -  Oct.  3,  2009 


Probable  botnet  attack  which  our 
metric  could  mitigate  via  both 
temporal  and  spatial  means  > 


<  Temporal  (single  IP)  example 
where  our  metric  could  mitigate 
spam  reception 
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SPAM:  CONTRIBUTIONS 


SNARE  [3]  (GA-Tech) 


•  Supervised  learning  across  13-network  level  features,  including  spatio-temporal  ones 

•  Don't  need  blacklists  (but  neither  do  we,  only  known  spamming  IPs) 


r  ^ 

Existing  ‘Reputation  Systems’  [6] 


•  Exclusive  use  of  negative  feedback 

•  Existing  email  reputation  systems  [5]  focus  only  on  sharing  classifications 


DISTINGUISHING  CONTRIBUTIONS 


•  Formalization  of  predictive  spatio-temporal  reputation 

•  Development  of  a  lightweight  mail  filter,  capable  of  500k+  mails/hour 
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FUTURE:  WIKIPEDIA 


PURPOSE:  Build  a  blacklist  of  user-names/IPs 
based  on  the  probability  they  will  vandalize 


TEMPORAL 


Straightforward,  vandals  are  probably  repeat  offenders 

Registered  users  have  IDs  indicating  when  they  joined,  are 
new  users  more  likely  to  vandalize? 


SPATIAL 


Geographical:  Based  on  user  location  ( i.e Wash.  D.C.) 

Topical:  A  user  may  vandalize  one  topic  (Rush  Limbaugh), 
while  properly  editing  another  (Barack  Obama) 

Anonymous  users:  IP  address  properties 


Certain  administrators  have  rollback  (revert)  privileges 
Comment:  “Reverted  edit  by  X  to  last  edition  by  Y” 
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FUTURE:  QUANTM  [2]  MODEL 


PreSTA  may  trivially  fulfill  the  reputation 
component  of  qualifying  QTM  systems 

TDG-like  hierarchy  of  IP-delegation 

Spatial  groups  from  credential  depth? 

General-use  case  criteria: 

(1 )  There  must  be  a  grouping  function 
to  define  finite  sets  of  participants 

(2)  Observable  and  dynamic  feedback 
sufficient  to  construct  behavior  history 
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CONCLUSIONS 


Given  a  known  set  of  malicious  users 
(and  the  time  at  which  they  mis-behaved)... 


additional  malicious  users  may  be  identified  using. 


(1 )  Temporal  histories  of  principals  (2)  w.r.t  the  space  in  which  they  reside 


zz 


...  and  such  a  system  is  useful  for: 

(1)  Lightweight  spam 
filtering  above 
traditional  blacklists 

(2)  Detecting  editors 
probable  of  vandalism 
on  Wikipedia 

(3)  Fulfilling  the 
reputation  component  of 
any  QTM  system 
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CONCLUSIONS 


Given  a  known  set  of  malicious  users 
(and  the  time  at  which  they  mis-behaved)... 


additional  malicious  users  may  be  identified  using. 


(1 )  Temporal  histories  of  principals  (2)  w.r.t  the  space  in  which  they  reside 


xz 


...  and  such  a  system  is  useful  for: 


DONE 


(2)  Detecting  editors 
probable  of  vandalism 
on  Wikipedia 


(3)  Fulfilling  the 
reputation  component  of 
any  QTM  system 
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CONCLUSIONS 


Given  a  known  set  of  malicious  users 
(and  the  time  at  which  they  mis-behaved)... 


additional  malicious  users  may  be  identified  using. 


(1 )  Temporal  histories  of  principals  (2)  w.r.t  the  space  in  which  they  reside 


DONE 


...  and  such  a  system  is  useful  for: 


(3)  Fulfilling  the 

IN  PROGRESS  reputation  component  of 

any  QTM  system 
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CONCLUSIONS 


Given  a  known  set  of  malicious  users 
(and  the  time  at  which  they  mis-behaved)... 


additional  malicious  users  may  be  identified  using. 


(1 )  Temporal  histories  of  principals  (2)  w.r.t  the  space  in  which  they  reside 


zz 


...  and  such  a  system  is  useful  for: 


DONE 


IN  PROGRESS 


FUTURE 

WORK 
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